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The cohesin protein complex contributes to transcriptional regulation in a CTCF-independent manner by colocalizing 
with master regulators at tissue-specific loci. The regulation of transcription involves the concerted action of multiple 
transcription factors (TFs) and cohesin's role in this context of combinatorial TF binding remains unexplored. To in- 
vestigate cohesin-non-CTCF (CNC) binding events in vivo we mapped cohesin and CTCF, as well as a collection of tissue- 
specific and ubiquitous transcriptional regulators using ChlP-seq in primary mouse liver. We observe a positive corre- 
lation between the number of distinct TFs bound and the presence of CNC sites. In contrast to regions of the genome 
where cohesin and CTCF colocalize, CNC sites coincide with the binding of master regulators and enhancer-markers and 
are significantly associated with liver-specific expressed genes. We also show that cohesin presence partially explains the 
commonly observed discrepancy between TF motif score and ChIP signal. Evidence from these statistical analyses in wild- 
type cells, and comparisons to maps of TF binding in ta/2/-cohesin haploinsufficient mouse liver, suggests that cohesin 
helps to stabilize large protein-DNA complexes. Finally, we observe that the presence of mirrored CTCF binding events at 
promoters and their nearby cohesin-bound enhancers is associated with elevated expression levels. 



[Supplemental material is available for this article.] 

The evolutionarily conserved cohesin protein complex plays an 
essential role in chromosome cohesion during mitosis and meiosis 
(Peters et al. 2008). The core of the complex is a heterodimer of 
structural maintenance of chromosome (SMC) subunits (SMC1A 
and SMC3) connected by a third subunit RAD21 (MCD1/SCC1 
in budding yeast), forming an unusual tripartite ring-like structure 
(Anderson et al. 2002; Haering et al. 2002). RAD21 is bound to 
a fourth member (either STAG1, STAG2, or STAG3) and it has been 
proposed that the complex mediates cohesion by embracing sister 
chromatids (Nasmyth and Haering 2009). Several other proteins 
are associated with cohesin, including NIPBL (Nipped-B in fly, Scc2 
in budding yeast), which is required for loading of cohesin onto 
chromatin (Rollins et al. 2004). 

Although essential for sister chromatid cohesion, Nipped-B 
was first identified in Drosophila melanogaster as a result of its 
function in gene regulation, where it was suggested to facilitate 
enhancer-promoter interactions (Rollins et al. 1999). Similarly, 
mutations in core components of the cohesin complex can affect 
gene expression, and have been linked to developmental defects 
in a number of different species (Donze et al. 1999; Benard et al. 
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2004; Krantz et al. 2004; Vega et al. 2005; Horsfield et al. 2007; 
Zhang et al. 2007; Pauli et al. 2008). Beyond its presence on sister 
chromatids during cell division, cohesin is also expressed in post- 
mitotic cells and is loaded onto unreplicated chromosomes in 
telophase (Sumara et al. 2000; Zhang et al. 2007; Wendt et al. 
2008). Together these findings point toward an important non- 
canonical role of cohesin in regulating gene expression. 

More recently, genome-wide maps of cohesin binding in 
mammalian cells reveal that the complex functionally associates 
with a large proportion of CTCF sites. Both repressor and acti- 
vator functions have been attributed to CTCF, but its role as an 
enhancer-blocking insulator is the most extensively studied, and 
cohesin plays a role in this function (Parelho et al. 2008; Rubio 
et al. 2008; Stedman et al. 2008; Wendt et al. 2008). Results from 
chromatin conformation capture (3C) experiments in the well- 
characterized H19/Igf2 imprinting control region (ICR) indicate 
that CTCF regulates allele-specific expression of the H19 and 
Igf2 genes by controlling intrachromosomal looping interactions 
(Murrell et al. 2004; Kurukuti et al. 2006; Yoon et al. 2007; Engel 
et al. 2008). Direct evidence from cohesin knockdown experiments 
implicates the complex in facilitating long-range interactions be- 
tween CTCF sites at these loci, as well as at others including the 
IFNG, apolipoprotein, and hemoglobin, beta genes (Hadjur et al. 
2009; Mishiro et al. 2009; Nativio et al. 2009; Hou et al. 2010). 

There is increasing evidence to suggest that changes in 
higher-order genome structure and subnuclear chromatin locali- 
zation are crucial for lineage specification and temporal/tissue- 
specific transcriptional regulation (Misteli 2007). In view of CTCF's 
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involvement in mediating chromatin loops at specific develop- 
mentally regulated genomic loci, it has been suggested that the 
primary role of CTCF may be the genome-wide organization of 
chromatin architecture (Phillips and Corces 2009). Globally, this 
hypothesis is supported by observations that CTCF-binding sites 
demarcate the borders of regions localized to the nuclear periphery 
(Guelen et al. 2008) and that they tend to correlate with intra- 
and interchromosomal interactions in the human genome as 
measured by Hi-C (Lieberman-Aiden et al. 2009; Botta et al. 2010). 
However, given the similarities in CTCF occupancy across differ- 
ent cell types (Cuddapah et al. 2009), it is not clear how CTCF 
alone could configure the three-dimensional structure of the ge- 
nome in a dynamic way. Considering that cohesin and CTCF 
colocalize genome wide, it is an attractive hypothesis that cohesin 
contributes to this global organizational function. Indeed, muta- 
tions causing human cohesinopathies such as Cornelia de Lange 
Syndrome severely disrupt the subnuclear organization of chro- 
matin and cause aberrant nucleolar morphology when induced in 
budding yeast (Gard et al. 2009). Furthermore, studies at specific 
loci show that cohesin dynamically controls the spatial confor- 
mation of chromatin required for normal development and dif- 
ferentiation, in a cell-division independent way (Hadjur et al. 2009; 
Seitan et al. 2011). 

Using ChlP-seq experiments in MCF-7 and HepG2 human 
cancer cells, we have recently shown that cohesin binds to thou- 
sands of sites in a CTCF-independent manner. In stark contrast to 
relatively invariant CTCF sites, these CNC-binding events differ 
dramatically between cell types. CNC sites colocalize with tissue- 
specific transcription factors (TFs), such as estrogen receptor alpha 
(ER) in MCF-7 cells, and contribute to global gene expression. 
Cohesin is also highly enriched at ER-bound regions that par- 
ticipate in interchromosomal looping interactions as assayed by 
ChlA-PET (Fullwood et al. 2009; Schmidt et al. 2010a). A study in 
mouse embryonic stem (ES) cells, which highlighted subunits of 
both the cohesin and mediator complexes as key contributors to 
ES cell state, found analogous patterns of CNC binding and co- 
occupancy with pluripotency regulators, such as POU5F1 (also 
known as OCT4), at interacting promoter and enhancer regions 
(Kagey et al. 2010). Finally, results from 3C experiments show that 
cohesin is required for similar promoter-enhancer interactions 
within the T-cell receptor alpha locus (Seitan et al. 201 1), and taken 
together with previous findings, firmly establish a role for the 
complex in widespread mediation of long-range transcriptional 
regulation. 

To investigate in vivo patterns of cohesin binding in depth — 
particularly independent of CTCF — we mapped both factors to- 
gether with a collection of 10 TFs using chromatin immunopre- 
cipitation experiments followed by high-throughput sequencing 
in primary mouse liver. We collected additional data from the 
same tissue for several histone modifications and other functional 
DNA-protein interactions, providing a comprehensive map of 
cohesin's role in tissue-specific transcriptional regulation. We 
show that this role is likely to be functionally similar across mul- 
tiple tissues by demonstrating that cohesin's presence at binding 
events of liver-specific TFs parallels its localization with ES cell- 
specific factors. We observe a positive correlation between the 
number of distinct TFs bound and cohesin presence, where most 
multiply bound ds-regulatory modules are CNC sites. In contrast 
to sites with CTCF, CNC sites tend to coincide with the binding of 
master regulators, including HNF4A, enhancer-markers such as 
EP300 (also known as p300), and are significantly associated with 
genes expressed in a liver-specific fashion. We also show that 



cohesin presence at least partially explains the commonly ob- 
served discrepancy between TF motif score and ChIP signal, sug- 
gesting a role for cohesin in stabilizing large protein-DNA com- 
plexes by enabling TFs to bind sequences less similar to the 
canonical binding site motif. Indeed, compared with wild-type 
mouse liver cells, ChIP signals in Rad21 -cohesin haploinufficient 
cells are preferentially diminished at binding events without high- 
scoring motifs. Finally, we identify cases where the presence of 
a cohesin-bound enhancer/CTCF pair is mirrored by the presence 
of CTCF near the putative target transcription start site (TSS) and 
observe differences in gene-expression levels that are associated 
with these consistent binding patterns. 

Results 

We performed ChlP-seq experiments in primary mouse liver with 
antibodies targeted to CTCF, three cohesin subunits (RAD21, STAG1, 
STAG2), 10 TFs (CEBPA, HNF4A, FOXA1, FOXA2, ONECUT1, HNF1A, 
PKNOX1, REST, GABPA, E2F4), two coactivators (EP300, CREBBP), 
five histone-modifications (H3K4mel, H3K4me3, H3K36me3, 
H3K79me2, H2AK5ac), and RNA polymerase II (RNAP2). See 
Methods for full experimental details. 

The TFs for our analysis were chosen to include both ubiq- 
uitously expressed factors and liver-specific regulators, two of 
which have well-characterized evolutionary dynamics (Schmidt 
et al. 2010b). We additionally profiled chromatin marks associated 
with active TSSs, enhancers, and transcribed genes, providing 
a comprehensive picture of the genome function and the tran- 
scriptional regulatory network active in mouse liver cells. Figure 1A 
displays a number of key regulatory features of the data in the vi- 
cinity of the predominantly liver-specific phosphoenolpyruvate 
carboxykinase 1 (Pckl) gene on mouse chromosome 2, including 
two clusters of TFs: one immediately proximal to the TSS and an- 
other —25 kb upstream of the TSS. Cohesin can be seen colocal- 
izing with CTCF as well as with clusters of TFs. These data are 
quantitatively and qualitatively comparable to other multifactor 
experiments in other tissues (Chen et al. 2008). 

After short read alignment with Burrows-Wheeler Alignment 
tool (BWA) (Li and Durbin 2009) and peak-calling with SWEmbl 
(Wilder et al., in prep.; see Methods), we determined the over- 
lap between sites bound by CTCF and the cohesin subunits. As 
expected, the three assayed cohesin subunits show highly similar 
patterns of binding with peaks of the RAD21 subunit coinciding 
with 99% and 94% of STAG1 and STAG2 peaks, respectively 
(Schmidt et al. 2010a). By defining cohesin presence as the oc- 
currence of at least one of its subunits, we find that cohesin 
colocalizes at the majority of CTCF sites (48,487; 87%), but is 
also present at a similar number of sites independently of CTCF 
(Fig. IB). We define this latter set of 46,471 cohesin-binding sites 
as cohesin-non-CTCF (CNC) sites (Supplemental Table SI). 

CTCF-independent cohesin binding is associated with master 
regulators and enhancers 

To determine the binding partners of cohesin at both cohesin- 
CTCF and CNC sites, we defined a set of putative ds-regulatory 
modules (CRMs) by grouping together CTCF and cohesin binding 
events with overlapping binding events of the 10 TFs and the 
coactivators EP300 and CREBBP (see Methods). The resulting CRMs 
have a median width of 449 bp, but vary in size depending on the 
number of factors present (SD = 346 bp; see Supplemental Table S3 
for all peak width statistics). The comparatively broad regions of 
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Figure 1 . (A) Genome-wide occupancy of cohesin, CTCF, tissue-specific and ubiquitous TFs in primary mouse liver as measured by ChlP-seq and shown 
near the Pckl gene. Cohesin colocalizes with CTCF as well as with clusters of transcription factors in the absence of CTCF, one of which can be seen 
overlapping the TSS of the Pckl gene. (B) Venn diagram showing CTCF and cohesin (RAD21, STAG1, STAG2) occurrence within CRMs. The pie charts 
indicate genomic locations of all CRMs (background), as well as those containing CTCF and CNC. The latter occur within promoter regions at a higher 
relative frequency compared with the other two classes. 



the genome associated with the histone modifications (H3K4mel, 
H3K4me3, H3K36me3, H3K79me2, H2AK5ac) and RNAP2 were 
not used to define the CRMs themselves. Instead, they were used to 
annotate the chromatin state of the CRMs post hoc (see Methods). 

Of the 223,890 CRMs that were identified, 43,458 (19.4%) are 
identified as CNC sites. These CRMs mostly occur away from an- 
notated TSSs (77%; Fig. IB) and tend to coincide with the binding 
of master regulators, such as HNF4A (69%), and the enhancer 
markers EP300 and/or CREBBP (66%). The fraction of promoter- 
proximal CNCs (23%) is nevertheless higher than that of CTCF 
(17%), and CNC-containing CRMs are significantly enriched for 
occurrence near TSSs when compared with all CRMs (Fisher's exact 
test P < 10~ 15 ; Fig. IB). CNC sites that occur within promoter 
regions (<2.5 kb from the annotated TSS), are highly enriched 
for RNAP2 binding compared with cohesin-bound promoters in 
general (Fisher's exact test P < 10~ 15 ). These results are similar to 
those in Drosophila, where cohesin lacks a functional interaction 
with CTCF (Bartkuhn et al. 2009), but is preferentially detected at 
promoters of active genes (Misulovin et al. 2007). Here cohesin 
selectively binds genes with paused RNAP2 and lacking H3K36me3, 
a mark associated with transcriptional elongation (Fay et al. 2011). 
Although we find that cohesin is associated with increased RNAP2 
pausing indices in mouse liver cells, cohesin-bound promoters are 
also associated with elevated expression levels and an enrichment 
of H3K36me3 within the gene body (Supplemental Fig. SI). 



At cohesin sites containing CTCF, we observe a shift in the 
summit positions of all cohesin subunits with respect to the CTCF 
summit position when the orientation of the CTCF motif is 
taken into account (Supplemental Fig. S2). This result is similar to 
recent reports for RAD21 (Nitzsche et al. 2011) and supports a di- 
rect and directional biochemical interaction between cohesin and 
CTCF. The same directional analysis at CNC sites, however, reveals 
that the position of cohesin is independent of the peak posi- 
tion and motif orientation of all other sequence-specific factors 
considered (Supplemental Fig. S2). This demonstrates a specific 
cohesin-CTCF interaction that is not seen at CNC sites and sug- 
gests a different mechanism of cohesin recruitment in the absence 
of CTCF. 

To identify the primary interacting partner proteins within 
the CRMs, we used the within-CRM ChIP fragment count (i.e., 
the number of mapped ChIP reads, extended to the estimated 
fragment length, overlapping the CRM) to measure the binding 
strength correlation between all ChlP-seq data sets. These corre- 
lations highlighted two separate modes of cohesin binding. First, 
a clear and distinct cluster includes all three cohesin subunits and 
CTCF (Fig. 2A, purple cluster). Second, cohesin subunits also cor- 
relate with tissue-specific factors including FOXA1, HNF4A, and 
HNF1A (green cluster); TFs are not correlated with CTCF. The 
cohesin/TF cluster is also marked by the active histone modifica- 
tions H3K4me3, H3K4mel, and H2AK5ac, as well as with the 
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Figure 2. Within-CRM binding correlations reveal distinct modes of 
cohesin binding in diverse cell types. The number of ChIP fragments 
(mapped reads extended to the estimated fragment length) overlapping 
a given CRM was used as a measure of binding strength for each data set. 
Factors were clustered along both axes based on the similarity in their 
colocalization profiles. (A) Heatmap visualization of all pairwise correla- 
tions between all ChlP-seq data sets in mouse liver cells illustrates cohesin 
subunits (RAD21, STAG1, STAG 2) clustered with CTCF. Cohesin also 
correlates with key tissue-specific TFs (FOXA1 , HNF4A, and HNF1A) in- 
dependently of CTCF as well as with histone modifications associated with 
transcriptional activity (H3K4me1, H3K4me3 / H2AK5ac) and coactivators 
(EP300 and CREBBP). (B) All pairwise correlations between previously 
published ChlP-seq data sets in mouse embryonic stem cells. Cohesin 
binding strength (SMC1 A, SMC3) correlates with CTCF while also forming 
a distinct cluster with key regulators of stem cell identity (POU5F1 , SOX2, 
NANOG, MYC), components of the mediator complex, as well as RNAP2. 
Similar results were obtained by performing the correlation analysis sep- 
arately on CRMs with CNC and CTCF (see Supplemental Fig. S3). 

coactivators EP300 and CREBBP, suggesting that cohesin within 
CNC sites may play a central role in active transcriptional regula- 
tion together with a wide range of TFs. 



To investigate whether the correlations between CTCF, co- 
hesin, and tissue-specific TFs are particular to liver or differentiated 
tissue, we performed the same analysis for a set of previously 
published ChlP-seq data sets from mouse embryonic stem (ES) 
cells (Chen et al. 2008; Marson et al. 2008; Seila et a. 2008; Kagey 
et al. 2010) (see Methods). Although the ES cell ChlP-seq data set 
contains a different collection of TFs and cohesin subunits, they 
show patterns highly similar to those observed in primary liver 
tissue (Fig. 2). Indeed, the SMC1A and SMC3 cohesin subunits 
correlate with CTCF (Fig. 2B, purple cluster) while also forming 
a separate, distinct cluster with key regulators of stem cell identity 
(POU5F1, SOX2, and NANOG), components of the mediator 
complex, and RNAP2 (Fig. 2B, green cluster). The cohesin loading 
factor is absent from the SMC1A/SMC3/CTCF cluster, which is 
consistent with previous observations of NIPBL's preferential as- 
sociation with CNC sites and supports the idea of a different 
mechanism of cohesin recruitment in the absence of CTCF (Kagey 
et al. 2010). Overall, cohesin shows two separate modes of binding 
that have minimal overlap in two transcriptionally divergent and 
phenotypically distinct mouse tissues: (1) either with CTCF and 
showing minimal signs of transcriptional activity, or (2) with 
clusters of tissue-specific TFs showing hallmarks of transcrip- 
tional activation. 



CNC sites occur preferentially at multiply bound c/5-reguIatory 
modules [CRMs] 

To understand the genomic properties of the identified CRMs, 
we grouped them into similar clusters based either on the nor- 
malized ChIP enrichment or binary presence/absence of the se- 
quence-specific factors using two different clustering methods 
(K-means and AutoClass) (see Methods). A primary difference be- 
tween these two clustering methods is that K-means requires the 
number of clusters to be defined a priori, whereas AutoClass uses 
a Bayesian probabilistic approach to automatically optimize the 
properties of each cluster (as well as the number of clusters) to 
achieve the best separation. Because the overall clustering results 
were similar between the two methods, we focused our analysis 
on the results from K-means (with K = 10) for ease of interpreta- 
tion (Fig. 3) (see Supplemental Fig. S4 for AutoClass results; Sup- 
plemental Fig. S5 for a justification of the choice of K). 

The 10 clusters, totaling 210,067 CRMs, are visualized in 
Figure 3, sorted from left to right by the fraction of CNC-con- 
taining CRMs in a given cluster. CRMs with CTCF form a large, 
distinct cluster at the extreme left (cluster 10; 41,368 CRMs). Most 
CRMs without CTCF fall into three large groups (clusters 7-9; 
102,091 CRMs) with an average of less than two sequence-specific 
factors (singleton CRMs). The remaining six clusters (66,608 CRMs) 
have increasing numbers of colocalizing TFs, with almost all pos- 
sessing either HNF4A, FOXA1, or FOXA2 (99%) and nearly half 
(48%) possessing all three of these factors. Furthermore, these six 
clusters all show a distinct pattern of chromatin state. For example, 
compared with singleton CRMs, clusters 1-3 are more strongly 
enriched for RNAP2, EP300/CREBBP, and H3K4mel (P < 10" 15 ), 
indicating that these clusters are likely to contain active enhancers. 

Ranking the CRM clusters by the proportion of CNC sites 
in each cluster, we observe a strong positive correlation between 
the average number of distinct TFs present and CNC presence 
(Spearman's p = 0.95, P < 10~ 15 ). In other words, CNC sites occur 
preferentially at multiply bound CRMs. The most highly bound 
cluster of CRMs (cluster 1) is also enriched for well-conserved 
TF-binding events (Schmidt et al. 2010b) compared with the 
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Figure 3. Cohesin-non-CTCF (CNC) binding occurs preferentially at multiply bound CRMs. (A) Results from K-means clustering (K = 1 0) of the binary 
presence/absence of ChlP-seq peaks corresponding to the 1 1 sequence-specific factors within 210,067 CRMs containing at least one of these factors. 
Factors were clustered based on the similarity in their binary occupancy profiles. The clusters were indexed and sorted by the proportion of CRMs with CNC 
in each cluster (increasing from left to right). (B) The binary presence/absence of ChlP-seq peaks for various chromatin features (non-sequence-specific 
factors and histone modifications) visualized according to the K-means results in A. Genomic location with respect to promoters (<2.5 kb from an 
annotated TSS), exons, introns, and gene distal regions, is also indicated. The proportion of CRMs with CNC sites in each cluster is indicated at the bottom 
(increasing from left to right). (C) Barplot indicating the mean number of distinct TFs within each CRM cluster. Bar widths correspond to the number of 
CRMs within each cluster. 



remaining clusters, including CEBPA-binding events shared in 
five species from chicken to human (Fisher's exact test P = 10~ 10 ) 
and HNF4A-binding events shared in human, mouse, and dog 
(Fisher's exact test P < 10~ 15 ). Taken together, these observations 
suggest a role for cohesin in integrating regulatory information 
from multiple TFs and stabilizing the binding of large multiprotein 
complexes to ds-regulatory sequences. 

CNC presence is associated with liver-specific gene expression 

Results from the unsupervised clustering analysis suggested that 
there might be a direct correlation between the number of TFs 
bound within a CRM, CNC presence, and the transcriptional ac- 
tivity of the genomic regions. By explicitly grouping CRMs into 
classes based purely on the number of distinct TFs present, we see 
that the proportion of CNC-containing CRMs significantly corre- 
lates with the number of bound TFs (Spearman's p = 0.89, P = 10~ 3 ), 
whereas CTCF shows no significant correlation (Spearman's p = 
0.49, P = 0.13). Indeed, almost two-thirds (62%) of highly occupied 
CRMs, defined as containing five or more TFs, possess CNC sites. 
The ratio of CNC- to CTCF-containing CRMs (CNC enrichment) is 
0.2 when zero TFs are present, but reaches a maximum of three- 
fold at seven TFs before returning to equivalence at 10 TFs (Fig. 4A). 
The proportion of promoter proximal CRMs (<2.5 kb from an 



annotated TSS) is also correlated with the number of distinct TFs 
present (Spearman's p = 0.95, P < 10" 15 ), but in contrast to CNC 
enrichment that peaks at seven TFs, the proportion of both RNAP2 
and H3K4me3 increase monotonically from 0 to 10 TFs (Fig. 4B). 
Other signs of transcriptionally active chromatin, such as the 
presence of the coactivators EP300/CREBBP, show a similar con- 
sistently increasing trend from one to 10 TFs (data not shown). 

We next asked how these CRM occupancy patterns may 
be related to transcriptional output by assigning CRMs to their 
nearest canonical TSSs and using mouse liver expression data 
obtained by replicate RNA-seq experiments (Kutter et al. 201 1) (see 
Methods). In addition, we identified 107 genes that are signifi- 
cantly up-regulated in mouse liver (Su et al. 2004). Median gene 
expression of CRM-associated genes increases when more than six 
TFs are present (Fig. 4B); however, only CRMs with between six 
and nine TFs are significantly enriched for the 107 genes signifi- 
cantly up-regulated in mouse liver cells (Fig. 4A) (Fisher's exact test 
P < 0.01) (Su et al. 2004). Strikingly, the peak of enrichment for 
tissue-specific genes at seven TFs coincides precisely with the peak 
in CNC enrichment at seven TFs. The three-way correspondence 
between liver-specific gene expression, CRM occupancy, and CNC 
sites, provides further evidence of cohesin's CTCF-independent 
transcriptional regulatory role at regions where multiple TFs as- 
semble to effect tissue-specific expression. 
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genes. However, due to the low number of CRMs with all 10 fac- 
tors, the confidence intervals for the expression value are large. 

The group of genes associated with these HOT regions in- 
cludes Polr2a, which encodes the largest subunit of the RNA 
polymerase II complex and Ccnll, a gene whose product (cyclin LI) 
participates in the regulation of the pre-mRNA splicing process 
(Supplemental Table S2; Dickinson et al. 2002). Another gene 
with a nearby HOT region, Grlfl, encodes a transcription factor 
that binds to the promoter region of the glucocorticoid receptor 
(Nr3cl), a gene that is expressed in almost all cell types (Adcock 
and Caramori 2001). 

These observations support the idea that HOT regions con- 
sist of constitutively open chromatin (Gerstein et al. 2010). Al- 
though the number of TFs in this study is limited, and many of 
those that were included have tissue-specific functions, these are 
the first HOT regions to be identified in vertebrates with similar 
properties to those described in the model organisms D. mela- 
nogaster and C. elegans. 




3 4 5 6 
#TFs per CRM 

Figure 4. CNC sites are associated with liver-specific gene expression. 
(A) Ratio of CNC-containing CRMs versus those with CTCF (log-fold 
change) for CRM classes with 0-1 0 TFs. Each class of CRMs was also tested 
for association with 107 genes signficantly up-regulated in mouse liver 
cells (see Methods). The significance of the association (negative-log- 
transformed Fisher's exact test P-values) are indicated. (*) P< 0.01; (**) P< 
0.001 . The enrichment of CNC-containing CRMs reaches threefold when 
seven TFs are present, and coincides with highly significant enrichment 
for an association with liver-specific gene expression for the same class. (B) 
CRMs with high numbers of colocalizing TFs are associated with increased 
promoter proximity (<2.5 kb from an annotated TSS) and characteristics 
of transcriptional activity (RNAP2 and H3K4me3 ChlP-seq peaks). Like- 
wise, the associated absolute gene expression value increases significantly 
with the number of bound TFs. Error bars indicate the 95% confidence 
interval of the median. 



Maximally occupied CRMs show similar properties 
to HOT regions 

A total of 34 CRMs contain all 10 assayed TFs. These regions have 
similar characteristics to recently identified high-occupancy target 
(HOT) regions (Moorman et al. 2006; Gerstein et al. 2010; Negre 
et al. 2011). These CRMs have high ChIP signal for all of the 
10 TFs, are highly enriched in promoter-proximal regions (Fisher's 
exact test P = 10~ 13 ), and are associated with genes having high 
absolute expression value — yet none of these are liver-specific 



Cohesin intensity explains disparities between motif score 
and ChIP signal 

The resolution of ChlP-seq data lends itself to the problem of 
finding TF-binding site motifs, as the actual binding site is typically 
within —50 bp of the peak summit. Nonetheless, the presence of 
the canonical motif usually explains only a fraction of the original 
ChlP-seq peaks (Valouev et al. 2008). Although the proportion 
of peaks with a motif match is dependent on the chosen score 
threshold, some ChlP-positive sequence regions have no recog- 
nizable similarity to the canonical motif (Johnson et al. 2007; 
Boyle et al. 2011). Furthermore, quantitative TF binding, as mea- 
sured by either ChlP-chip or ChlP-seq enrichment, is only weakly 
correlated with motif strength, as measured by the PWM log-odds 
score (Schmidt et al. 2010b; Wilczyhski and Furlong 2010). 

In order to investigate these phenomena, we asked whether 
there was an unexpected correlation between a given factor's 
motif score and another factor's ChIP signal within our identified 
CRMs. Briefly, for each sequence-specific factor, we first deter- 
mined the PWM score of the best motif match within each corre- 
sponding peak. We then compared this motif score with the ChIP 
signal of all other data sets within CRMs containing that peak. 
Similarly, motif score correlations were calculated for the oc- 
cupancy count (i.e., the number of distinct TFs present) and the 
distance to the nearest canonical TSS (Fig. 5A; Supplemental 
Figs. S6, S7). 

As expected due to their roles at the core promoter, high 
motif scores for both GABPA and E2F4 are most associated with 
H3K4me3 ChIP signal and tend to occur near to annotated TSSs 
(Conboy et al. 2007). In addition, CRM occupancy count is anti- 
correlated with motif scores of all factors except E2F4, indicating 
that when TFs occur in the absence of other potential binding 
partners, their binding is more likely to coincide with a high- 
scoring motif match. However, for only four out of the 11 se- 
quence-specific factors that we tested, the factor's motif score is 
most strongly correlated with its own ChIP signal. In fact, the 
strength of motif score for four different factors (ONECUT1, FOXA1, 
FOXA2, and HNF4A) is most strongly associated with HNF4A 
ChIP signal. 

Interestingly, cohesin ChIP signal is also anticorrelated with 
motif scores of all assayed factors except CTCF (Spearman's p = 
0.11) and E2F4 (Spearman's p = 0.13); in other words, stronger 
cohesin binding is associated with lower-quality motif matches for 
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Figure 5. Cohesin ChIP signal is significantly associated with TF motif score. (A) Cartoon heatmap representation of correlations between each se- 
quence-specific factor's motif score and the ChIP signal of all available ChlP-seq data sets. Correlations with CRM occupancy (number of distinct TFs 
present) and promoter proximity (distance to the nearest canonical TSS) are also shown. For each factor, the motif score correlation was calculated on the 
set of CRMs that contained a ChlP-seq peak for the same factor. Correlations with cohesin and coactivator ChIP signal were averaged over subunits 
(RAD21, STAG1, STAG2) and family members (EP300, CREBBP), respectively. Heatmap rows were ordered by increasing correlation with cohesin ChIP 
signal (from top to bottom). As a visual summary, only the top- and bottom-ranking correlations involving TFs are shown (see Supplemental Figs. S6, S7 for 
all correlations). (B) Increased cohesin ChIP signal at TF binding events without motifs. For each sequence-specific factor, the number of cohesin ChIP 
fragments within CRMs without high-scoring motifs was compared with that of CRMs with motifs. The 95% confidence intervals shown are based on 
a normal approximation of the Hodges-Lehmann estimate (median of all possible differences). 



co-bound TFs. The two exceptions to this rule, i.e., positive corre- 
lations with E2F4 and CTCF, are unsurprising since strong E2F4 
motifs and binding are found in highly occupied CRMs and CTCF 
binding has previously been shown to correlate well with motif 
quality, and there is evidence that CTCF recruits cohesin to sites 
where they co-occur (Parelho et al. 2008). For all other factors, 
stronger cohesin ChIP signals are associated with lower motif 
scores, particularly for the ONECUT1 motif (Spearman's p = -0.5) 
and CEBPA motif (Spearman's p = -0.38). 

We also compared levels of cohesin ChIP signal between ex- 
plicit groups of CRMs: those with and those without high-scoring 
motifs according to a minimum PWM score threshold. For all se- 
quence-specific factors except CTCF and E2F4, we observe higher 
levels of cohesin in the absence of high-scoring motifs (Fig. 5B). 

To determine whether cohesin presence could help to ex- 
plain the discrepancy between TF ChIP signal and motif score, we 



trained logistic regression classifiers to predict the presence of high- 
scoring motifs for each sequence-specific factor, with and without 
cohesin ChIP signal information. For ONECUT1, CEBPA, HNF1A, 
PKNOXI, FOXAI, FOXA2, REST, and E2F4, cohesin ChIP in- 
formation markedly improved the performance of the classifier. For 
GABPA, HNF4A, and CTCF there is minimal improvement in per- 
formance with the inclusion of cohesin in the model (Supplemental 
Fig. S8). These results suggest that cohesin presence is able to par- 
tially decouple ChIP signal from motif score for a significant number 
of TFs, including those that are often found at enhancer elements. 

ONECUT1 ChIP signal is reduced at weak motifs 
in heterozygous Rad21 +, ~ mouse liver cells 

In order to determine whether cohesin plays an active role in the 
binding of TFs to their target sequences, particularly in the absence 
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of high-scoring motifs, we used liver tissue from mice with only 
one functional allele of the Rad21 gene. Homozygous knockouts 
of Rad21 are lethal early in embryogenesis, suggesting that at least 
one wild-type Rad21 allele is essential for normal development in 
mammals. Although heterozygous Rad21 +/ ~ mice are viable, they 
possess a number of defects including hypersensitivity to ionizing 
radiation and impaired DNA repair capacity (Xu et al. 2010). To 
confirm that the level of cohesin binding is reduced, and to de- 
termine whether TF binding is consequently affected, we mapped 
RAD21, ONECUT1, CEBPA, and HNF4A in heterozygous Rad21 +I ~ 
mouse liver cells using ChlP-seq. 

The total number of binding events for all TFs is reduced in 
heterozygous Rad21 +I ~ cells (ONECUT1 45%, CEBPA 63%, HNF4A 
18%) and, as expected, the reduction is most severe for RAD21 
(14%). A total of 78,625 CRMs lose RAD21 binding according to 
the absence of an overlapping peak in heterozygous Rad21 +I ~ cells. 
We focus the remainder of our analysis on these sites. In terms 
of peak loss, CRMs without high-scoring motifs are enriched for 
binding events lost in heterozygous Rad21 +I ~ cells (responsive 
binding events) for all three assayed TFs (Fisher's exact test P < 
10~ 15 ; Supplemental Fig. S9). We also performed statistically ro- 
bust differential binding analysis on replicate ONECUT1 and 
CEBPA ChlP-seq data in order to determine ChIP signal differences 
between wild- type and heterozygous Rad21 +I ~ cells (see Methods). 
Similar to the peak-level analysis results, CRMs exhibiting signifi- 
cantly reduced ONECUT1 ChIP signal in mutant cells are enriched 



for ONECUT1 peaks without high-scoring motifs (Fisher's exact 
test P = 10~ 4 ; Fig. 6B). However, differential binding analysis for 
CEBPA revealed no significant differences in ChIP signal. 

Interestingly, promoter-proximal CRMs tend to be associated 
with both reduced ONECUT1 motif scores (Fig. 5A) and Rad21 +/ ~ 
responsive ONECUT1 binding events (Fisher's exact testP< 10~ 15 ). 
This suggests that cohesin may help to stabilize the binding of 
ONECUT1 near promoters in particular. One such region is shown 
in Figure 6 A overlapping the BC031353 promoter, where all but 
one of the remaining ONECUT1 -containing CRMs displayed re- 
tain ONECUT1 binding in heterozygous Rad21 +I ~ cells (resistant 
binding events). Note that a high-scoring motif is absent from the 
Rad21 +, ~ responsive ONECUT1 binding event overlapping the 
TSS, although the effect on BC031353 expression was not assessed. 

Mirrored binding of CTCF near transcription start sites 
and cohesin-bound enhancers are associated with elevated 
expression levels 

Cohesin has been shown to be crucial for two distinct types 
of chromatin interactions: (1) looping between individual CTCF 
binding events (Hadjur et al. 2009; Mishiro et al. 2009; Nativio 
et al. 2009; Hou et al. 2010), and (2) interactions between pro- 
moters and CNC-containing enhancers (Kagey et al. 2010; Schmidt 
et al. 2010a; Seitan et al. 2011). Reports of long-range chromatin 
looping mediated by CTCF have suggested that CTCF may influence 
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Figure 6. ONECUT1 ChlP-seq in heterozygous Rad21 +/ ~ mouse liver cells shows preferential loss of TF binding events where no motif is present. 
(A) Sample region near the BC031 353 gene showing overall reduction in RAD21 ChIP signal in heterozygous Rad21 +/ cells (responsive RAD21) and 
associated significant reduction in ONECUT1 ChIP signal within two CRMs (responsive ONECUT1). The ONECUT1 binding event overlapping the TSS 
contains no ONECUT1 motif. (B) WT ONECUT1 CRMs without motifs show a preferential decrease in ChIP signal (FDR < 0.1) in heterozygous Rod21 +/ ~ 
mouse liver cells (Fisher's exact test P = 10~ 4 ). Regions of interest (ROI) are those CRMs where RAD21 binding was ablated in heterozygous Rad21 +/ ~ 
mouse liver cells (responsive RAD21). 
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transcription by facilitating enhancer-promoter interactions 
(Handoko et al. 2011). In this model, interactions between pro- 
moter-proximal and distal CTCF binding events connect enhancers 
to their target genes by looping out the intervening DNA, thereby 
reducing the effective distance and increasing the probability of 
interactions between linearly distant genomic regulatory regions. 

We therefore searched for genes where this configuration has 
the potential to occur, i.e., genes with CTCF/cohesin binding 
events both nearby the TSS and proximal to their associated en- 
hancers (Fig. 7B). If these consistent binding patterns have bi- 
ological relevance, we expect their presence to be associated with 
increased expression levels of the corresponding genes. To test this, 
we first compiled a list of putative liver-specific enhancers, defined 
as CRMs >5 kb from their nearest canonical TSS that possess 
(1) a CNC site, (2) the liver master regulator HNF4A, (3) the EP300 
enhancer marker, and (4) the histone signature H3K4mel, but 
(5) not H3K4me3. We then assigned each identified enhancer to 
the nearest gene based on distance to the TSS, such that each en- 
hancer is assigned to only one gene. 
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Figure 7. Simultaneous CTCF binding within promoters and nearby enhancers is associated with 
elevated expression levels. (A) Violin plots showing gene expression distributions. Genes with CTCF 
binding events both within their promoters and nearby their associated enhancers show significantly 
elevated expression levels over those of the other three indicated classes (Mann-Whitney U-test P < 
10 3 ). (B) Sample region near the liver-expressed Agxt gene, where CTCF binds within the core pro 



moter, as well as near putative upstream cohesin-bound enhancers. Note that while CTCF is absent from 
the enhancers (CNC), it co-binds with HNF4A and EP300 within the Agxt promoter. 



Of the 5364 genes with nearby enhancers as defined above, 
532 genes have CTCF binding events both 2.5 kb from the TSS and 
nearby their enhancers (<2.5 kb). Indeed, these genes have sig- 
nificantly greater expression values than genes with CTCF binding 
near the TSS, but not nearby their enhancers, or vice versa (Fig. 7 A) 
(Mann- Whitney [/-test P < 10" 3 ). 

Discussion 

Cohesin has multiple vital functions in mammalian cells, in- 
cluding well-established roles in sister chromatid segregation 
in mitosis and meiosis. Recent results have implicated cohesin in 
the regulation of gene expression. Because cohesin has no known 
DNA-binding domain, the mechanism of this transcriptional 
regulation is assumed to arise from cohesin's ability to stabilize 
higher-order chromatin structure through interactions with 
chromatin organization proteins such as CTCF (Hadjur et al. 2009; 
Nativio et al. 2009). We and others have shown that cohesin plays 
a role in tissue-specific transcriptional regulation and that this 
role is at least partially characterized by 
CTCF-independent cohesin localization 
with master regulators in several tissues. 
To better understand cohesin's contri- 
bution to gene regulation, we collected 
genome-wide localization data of 10 TFs, 
several histone modifications and other 
functional DNA-protein interactions in 
primary mouse liver. These data provide 
a comprehensive map of cohesin's two 
known roles: one associated with CTCF 
and another CTCF-independent role in 
tissue-specific transcriptional regulation. 
We show that these roles are functionally 
similar across multiple tissues by dem- 
onstrating that cohesin's presence at 
binding events of liver-specific TFs mir- 
rors its localization with ES cell-specific 
factors. 

To further characterize cohesin's tis- 
sue-specific regulatory role, we focused 
on the properties of cohesin-non-CTCF 
(CNC) sites. By clustering the binding 
patterns of sequence-specific factors within 
CRMs and ranking these clusters by the 
fraction that overlap CNC sites, we dem- 
onstrate that CNC sites occur preferen- 
tially at CRMs containing multiple TFs 
and are less likely to be found at CRMs 
with singleton binding events that rep- 
resent the majority of regions bound by 
any given factor. The class of CRMs with 
the most TFs is also highly enriched for 
binding events that are persistent across 
hundreds of millions of years of evolu- 
tion (Schmidt et al. 2010b), suggesting 
that the conservation of these events — 
and possibly those of other tissue-specific 
TFs — is attributable to their highly bound 
state and putative functional context. 
However, only 5% of the CRMs in the 
maximally occupied cluster have a deeply 
shared binding event (i.e., five-species 
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CEBPA or three-species HNF4A). Thus, the colocalization of a 
large number of TFs does not mean, a priori, that a binding event 
will be invariant over evolutionary time. 

We observe a striking relationship between CNC enrich- 
ment and liver-specific gene expression for CRMs with submaxi- 
mal numbers of distinct TFs bound. In particular, CRMs with be- 
tween six and eight of our assayed TFs are more than twice as 
likely to possess CNC sites than CTCF, and are also the most highly 
enriched for liver-specific gene expression. Although both CNC 
enrichment and association with liver-specific expression peaks 
when CRMs have seven TFs, we find no evidence in the mouse 
ES cell data that this is a (just right) "goldilocks" number of TFs. 
However, our results in mouse liver are consistent with previ- 
ous research in other species, showing that regions with low-to- 
moderate numbers of transcription factors are most significantly 
enriched for annotated enhancers and signs of active transcrip- 
tional regulation (Negre et al. 2011). Therefore, although the dis- 
tribution of tissue-specific and ubiquitous factors is different in 
the ES cell experiments, this does not rule out the attractive hy- 
pothesis that a specific and relatively small number of TFs bind- 
ing together and stabilized by cohesin is a fundamental charac- 
teristic of mammalian tissue-specific gene regulation. 

Intriguingly, the most highly occupied CRMs containing all 
10 of our assayed TFs are neither associated with liver-specific 
genes nor CNC enrichment. Instead, these regions seem to be 
nearby constitutively active genes and have characteristics that 
are similar to recently described HOT regions (Moorman et al. 
2006; Gerstein et al. 2010; Negre et al. 2011). To our knowledge 
these are the first HOT regions to be described in vertebrates. 

The DNA sequence preferences of TFs are typically described 
using position weight matrices (PWMs), and referred to as binding- 
site motifs. These motifs remain a challenge to discover compu- 
tationally despite the large number of de novo motif discovery 
algorithms that have been developed to infer these sequence pref- 
erences (Nguyen and Androulakis 2009). Previous results have 
demonstrated that while ChlP-seq data is useful for identifying 
the specific regions of the genome bound by a given TF, there re- 
mains a subset of binding events with either weak or nonexistent 
motif matches. This lack of a clear relationship between DNA se- 
quence content and TF recruitment has been described as a result 
of indirect or cooperative binding, and recent approaches tailored 
specifically to ChlP-seq data have subsequently focused on finding 
these candidate cofactors (Bailey 2011). 

Using both computational and experimental methods, we 
show that the presence of cohesin likely explains the inverse re- 
lationship between ChIP signal and motif score observed for 
a number of our assayed factors. These TFs bind to stronger motifs 
in the absence of cohesin. Stated alternatively, we observe higher 
levels of cohesin in the absence of high-scoring motifs. These re- 
sults suggest that cohesin enables TFs to bind to suboptimal motif 
sequences either by stabilizing large protein-DNA complexes at 
highly occupied CRMs or by inducing binding through specific 
chromatin contortions. Importantly, we show that computational 
classifiers trained to predict high-scoring motif occurrence exhibit 
markedly improved performance when cohesin is incorporated 
into the model. Furthermore, using ChlP-seq in the livers of a 
Rad21 -cohesin haploinsufficient mouse model, we show that 
heterozygous loss of Rad21 results in the loss of 86% of RAD21 
binding events found in the wild type. This is accompanied by 
a reduction in ChlP-seq peak numbers for ONECUT1, CEBPA, and 
HNF4A that disproportionately affects binding events without 
high-scoring motifs for these TFs. Similarly, we find that sites both 



without RAD21 peaks and showing a significant loss of ONECUT1 
ChIP signal in heterozygous Rad21 +I ~ cells are also significantly 
depleted for high-scoring ONECUT1 motifs. Taken together with 
our observations in wild-type cells that cohesin is more abundant 
at highly occupied CRMs and at those without high-scoring motifs, 
these results point toward a role for cohesin in stabilizing the bind- 
ing of TFs to ris-regulatory sequences, particularly near promoters. 
Alternatively, expression level differences of the TFs themselves 
caused by the loss of cohesin may contribute to the overall reduction 
in binding events observed in heterozygous Rad21 +I ~ cells. 

Promoter regions are important sites of TF binding, where 
multiple regulatory signals are integrated to coordinate cell-type- 
specific expression programs. Both CTCF and cohesin have been 
shown to modulate chromatin structure in order to enable pro- 
moter-proximal factors to respond to signals from distant cis-reg- 
ulatory elements, such as enhancers. However, our results indicate 
that the majority of highly occupied CRMs, which show typical 
characteristics of enhancers, possess cohesin in the absence of 
CTCF (CNC sites). An attractive hypothesis is that CTCF may set 
up indirect chromatin interactions as the primary step toward 
enabling enhancer-promoter communications (Handoko et al. 
2011). We tested whether the dual presence of CTCF-binding 
events both nearby TSSs and their corresponding enhancers is as- 
sociated with increased expression levels. Using this simple ap- 
proach, we observe genome-wide patterns that support the model 
that concerted CTCF binding to linearly distant regulatory regions 
is associated with significantly elevated expression levels. Further 
investigations using 3C-based chromatin conformation assays 
would be needed to determine whether these patterns are indeed 
associated with functional chromatin looping interactions be- 
tween enhancers and promoters. 

Methods 
ChIP sequencing 

ChIP experiments were performed with wild-type primary mouse 
(C57BL/6 and/or C57BL/6xA/J) liver tissue and antibodies against 
CTCF (two replicates, two individuals; antibody: Upstate Bio- 
technology, 07729), STAG1 (three replicates, two individuals; anti- 
body: Abeam, ab4457), STAG2 (singlicate; antibody: Abeam, 4464), 
RAD21 (singlicate; antibody: Abeam, ab992), CEBPA (six replicates, 
two individuals; antibody: Santa Cruz Biotechnology, sc9314), 
HNF4A (two replicates, one individual; antibody: aviva systems bi- 
ology, ARP31946), FOXA1 (two replicates, two individuals; antibody: 
Abeam, ab5089), FOXA2 (four replicates, two individuals; antibody: 
Santa Cruz Biotechnology, sc6554), ONECUT1 (six replicates, two 
individuals; antibody: Santa Cruz Biotechnology, scl3050), HNF1A 
(three replicates, one individual; antibody: Santa Cruz Biotechnol- 
ogy, sc6547), PKNOX1 (singlicate; antibody: Santa Cruz Biotechnol- 
ogy, sc6245), REST (singlicate; antibody: Santa Cruz Biotechnology, 
sc25398), GABPA (two replicates, one individual; antibody: Santa 
Cruz Biotechnology, sc22810), E2F4 (singlicate; antibody: Santa 
Cruz Biotechnology, scl082), EP300 (two replicates, two individuals; 
antibody: Santa Cruz Biotechnology, sc585), CREBBP (singlicate; 
antibody: Santa Cruz Biotechnology, sc369), RNAP2 (two repli- 
cates, two individuals; antibody: Abeam, ab5408), H3K4mel (sin- 
glicate; antibody: Abeam, ab8895), H3K4me3 (singlicate; antibody: 
Abeam, ab8580), H3K36me3 (singlicate; antibody: Abeam, ab9050), 
H3K79me2 (singlicate; antibody: Abeam, ab3594) and H2AK5ac 
(singlicate; antibody: Abeam, 1 764) as recently described (Schmidt 
et al. 2009). Briefly, the immunoprecipitated DNA was end-repaired, 
A-tailed, ligated to the sequencing adapters, amplified by 18 cycles 
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of PCR, and size selected (200-300 bp) followed by single-end 
sequencing on an Illumina Genome Analyzer according to the 
manufacturer's recommendations. 

ChIP experiments were performed with heterozygous Rad21 +/ ~ 
primary mouse liver tissue and antibodies against RAD21 (two 
replicates, two individuals; antibody: Abeam, ab992), CEBPA (two 
replicates, two individuals; antibody: Santa Cruz Biotechnology, 
sc9314), HNF4A (two replicates, two individuals; antibody: aviva 
systems biology, ARP31946), ONECUT1 (two replicates, two in- 
dividuals; antibody: Santa Cruz Biotechnology, scl3050) as above. 

Read mapping and peak calling 

All ChIP sequencing reads from each replicate were aligned to the 
mouse reference genome assembly (NCBI37/mm9) using BWA 
(Li and Durbin 2009) with default parameters. After pooling rep- 
licate data for each factor/histone-modification, the reads were 
then filtered to remove low-quality mappings (phred-scaled map- 
ping quality <10), multiple reads mapping to the same genomic 
location and strand, as well as those mapping to the mitochondrial 
genome. Peaks were then called on all data sets using matched 
input data and a dynamic programming algorithm (SWEmbl) with 
-R 0.005 as recently described (Schmidt et al. 2010a). See Supple- 
mental Table S3. 

Cohesin-non-CTCF site definition and peak clustering 

Firstly, overlapping ChlP-seq peaks for CTCF and the cohesin 
subunits (STAG1, STAG2, RAD21) were merged to form a set of 
disjoint genomic regions. Our definition of cohesin-non-CTCF 
(CNC) sites required the presence of at least one cohesin subunit 
peak and the absence of CTCF. In order to obtain a high-confidence 
set of CNC sites, in the absence of significant CTCF ChIP enrich- 
ment that may have escaped peak-detection, we required that 
these sites also satisfied the following criterion: log((norm_CTCF_ 
ChIP)/(norm_Input))<0.68. This cut-off corresponds to the fifth 
percentile of ChIP enrichment scores within CTCF peaks. 

Overlapping peak regions of the sequence-specific factors 
(CTCF, CEBPA, HNF4A, FOXA1, FOXA2, ONECUT1, HNF1A, 
PKNOX1, REST, GABPA, E2F4), as well as cohesin (STAG1, STAG2, 
RAD21), CNC sites, and the coactivators EP300/CREBBP, were 
merged to define putative ds-regulatory modules (CRMs) (Zinzen 
et al. 2009). A single-linkage clustering approach was used, where 
a peak overlap of >1 bp with at least one other peak within 
a CRM is sufficient for membership within the CRM. The pres- 
ence or absence of a particular histone-modiheation (H3K4mel, 
H3K4me3, H3K36me3, H3K79me2, H2AK5ac) or RNAP2 binding 
within a CRM was then determined post hoc by satisfaction of either 
of the following criteria: (1) the presence of an overlapping peak, or 
(2) ChIP enrichment within the entire CRM region of at least 
threefold, where the number of ChIP reads overlapping the CRM>8. 

Motif analysis and selection 

We used MEME (Bailey and Elkan 1994) and NestedMica (Down 
and Hubbard 2005) to perform de novo motif discovery for each 
sequence-specific factor using peak regions with the top 1000 
scores. In each case, 50 bp of DNA sequence surrounding the 
SWEmbl summit was used to find five frequently occurring se- 
quence motifs up to 25 bp in length (MEME parameters: -nmotifs 
5 -maxw 25 -revcomp; NestedMica parameters: -numMotifs 
5 -minLength 5 -maxLength 25 -revComp -backgroundOrder 
1 -backgroundClasses 4). We scanned all bound (positive) re- 
gions for each factor with PWMs for all five NestedMica motifs 
as well as the top-scoring MEME motif to determine the score of 



the best motif match in each case. We repeated this using equally 
sized unbound (negative) regions, which were randomly sampled 
from the repeat- and exon-masked genome. The optimal motif 
for each factor, which was retained for further analysis, was defined 
as that best able to discriminate between positive and negative 
regions according to the AUC (area under ROC curve) performance 
measure. 

Mouse embryonic stem cell data analysis 

Publicly-available ChlP-seq data sets from mouse embryonic stem 
cells were downloaded, reprocessed, and analyzed using a similar 
procedure to that described above: CTCF, MYC, ESRRB, KLF4, 
MYCN, SMAD1, STAT3, TCFCP2L1, ZFX, EP300, SUZ12 (Chen 
et al. 2008), NANOG, POU5F1, SOX2, H3K79me2 (Marson et al. 
2008), RNAP2 (Sella et al. 2008), NIPBL, SMC1A, SMC3, MED1, 
MED 12 (Kagey et al. 2010). See Supplemental Table S3. 

CRM clustering and analysis 

To restrict our analysis to sites with possible patterns of combina- 
torial TF binding, we filtered our data to retain only CRMs con- 
taining a binding event of at least one sequence-specific factor. 
We used two independent methods (K-means and AutoClass) to 
group CRMs into similar clusters. 

(1) We used K-means to group CRMs into K similar clusters based 
on the binary presence/absence of the 11 sequence-specific 
factors within each CRM. In order to choose an appropriate 
value for K, we ran the clustering algorithm on a random subset 
of 20,000 CRMs and determined the median within-cluster 
sum of squares (WCSS) over 10 replicates of each value of K in 
the range [2-50]. The WCSS tends to decrease as the number 
of clusters K increases, but the decrease flattens slightly for 
values of K near 10 (see Supplemental Fig. S5). We used this 
"elbow" method to choose a value of K = 10 when running the 
algorithm on the entire data set. 

(2) We used AutoClass (Cheeseman and Stutz 1996) to group CRMs 
into similar classes based on the normalized ChIP enrichment 
of the 11 sequence-specific factors within each CRM. AutoClass 
uses a Bayesian probabilistic approach to automatically opti- 
mize the properties of each class (as well as the number of 
classes) to achieve the best separation. An advantage of this 
"fuzzy" clustering approach, not provided by other traditional 
clustering methods such as K-means, is the availability of 
a measure (posterior probability) to assess the confidence that 
each CRM belongs to its assigned class. The AutoClass C 
command-line program was used with the following primary 
settings: (1) data model: single_normal_cn (factor ChIP en- 
richments follow conditionally independent normal variables); 
(2) convergence criterion: converge_3 (most stringent); (3) 
initial values for the number of class: 2, 3, 5, 7, 10, 15, 25, 35, 45, 
55, 65, 75, 85, 95, 105; (4) absolute error of input data: 10% for 
all factors. We filtered the CRMs in the resulting classification 
to retain only those with posterior probabilities >0.5 and 
combined classes with high correlation (Spearman's p > 0.9) 
between their median ChIP enrichment profiles. 

Other CRM attributes such as the ChIP peak presence/absence 
of other factors/histone-modifications not used in the original 
clustering were added to aid visualization of the clustering results. 
Gene annotation information from Ensembl version 60 (Flicek 
et al. 2010) was used to add genomic localization information for 
each CRM, where "Promoter" was defined as occurring <2.5 kb 
from an annotated TSS, "Exon" corresponds to overlap with an 
exon but not a promoter, "Intron" corresponds to overlap with 
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a gene but neither an exon nor a promoter, and "Distal" for lo- 
calization elsewhere. Gene annotation information for pseudo- 
genes was ignored throughout the analysis. 

Expression analysis 

We used previously published RNA-seq data from mouse liver to 
obtain absolute expression estimates for all genes (Kutter et al. 
2011). Briefly the raw reads were truncated to 35-mers and aligned 
to mouse transcript sequences (cDNA sequences from Ensembl 
version 60, NCBI37/mm9) using Bowtie version 0.12.7 (Langmead 
et al. 2009) with default parameters. Normalized gene expression 
estimates were obtained using MMSEQ (Turro et al. 2011) and 
summarized by taking the replicate mean. 

We used a previously published data set consisting of ex- 
pression measurements from 40 diverse mouse tissues to deter- 
mine sets of genes with liver-specific patterns of expression (Su 
et al. 2004). The processed data (Array Express accession: E-MTAB- 
25) was obtained from the Gene Expression Atlas (Kapushesky 
et al. 2010) where up-regulation in a particular tissue with respect 
to the remainder was assessed using a t-test and P < 0.05. 

Motif presence prediction 

For each sequence-specific factor, we trained logistic regression 
classifiers to predict the presence of high-scoring motif matches 
using the ChIP signals (estimated number of ChIP fragments 
overlapping a given CRM) of various factors. Models were trained 
using: (1) ChIP signal of the corresponding factor, and (2) both 
ChIP signal of the corresponding factor and ChIP signals of the 
cohesin subunits (RAD21, STAG1, STAG2). Motif score cut-offs 
corresponding to FDR = 0.4 were chosen to determine high-scoring 
motif match presence/absence (see Supplemental Fig. S10). Ten- 
fold cross-validation was performed using CRMs containing a 
peak for the factor of interest, where 50% of these CRMs were 
randomly selected for the training set and the remaining 50% 
formed the test set. 

Wild-type versus heterozygous Rad21 +, ~ differential 
binding analysis 

Read mapping and filtering for CEBPA and ONECUT1 was carried 
out as described above for both wild-type and heterozygous 
Rad21 +I ~ ChlP-seq data sets, except reads for biological replicates 
were handled separately (technical replicates were pooled). The 
DiffBind package (Ross-Innes et al. 2012) was used with default 
parameters to determine CRMs with significantly lower ChIP sig- 
nal in heterozygous Rad21 +I ~ mouse liver cells versus wild- type 
liver cells (FDR threshold = 0.1). 

Data access 

Data deposited under ArrayExpress accession number E-MTAB-941. 
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